14 research outputs found
Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text
Generating natural human motion from a story has the potential to transform
the landscape of animation, gaming, and film industries. A new and challenging
task, Story-to-Motion, arises when characters are required to move to various
locations and perform specific motions based on a long text description. This
task demands a fusion of low-level control (trajectories) and high-level
control (motion semantics). Previous works in character control and
text-to-motion have addressed related aspects, yet a comprehensive solution
remains elusive: character control methods do not handle text description,
whereas text-to-motion methods lack position constraints and often produce
unstable motions. In light of these limitations, we propose a novel system that
generates controllable, infinitely long motions and trajectories aligned with
the input text. (1) We leverage contemporary Large Language Models to act as a
text-driven motion scheduler to extract a series of (text, position, duration)
pairs from long text. (2) We develop a text-driven motion retrieval scheme that
incorporates motion matching with motion semantic and trajectory constraints.
(3) We design a progressive mask transformer that addresses common artifacts in
the transition motion such as unnatural pose and foot sliding. Beyond its
pioneering role as the first comprehensive solution for Story-to-Motion, our
system undergoes evaluation across three distinct sub-tasks: trajectory
following, temporal action composition, and motion blending, where it
outperforms previous state-of-the-art motion synthesis methods across the
board. Homepage: https://story2motion.github.io/.Comment: 8 pages, 6 figure
IT3D: Improved Text-to-3D Generation with Explicit View Synthesis
Recent strides in Text-to-3D techniques have been propelled by distilling
knowledge from powerful large text-to-image diffusion models (LDMs).
Nonetheless, existing Text-to-3D approaches often grapple with challenges such
as over-saturation, inadequate detailing, and unrealistic outputs. This study
presents a novel strategy that leverages explicitly synthesized multi-view
images to address these issues. Our approach involves the utilization of
image-to-image pipelines, empowered by LDMs, to generate posed high-quality
images based on the renderings of coarse 3D models. Although the generated
images mostly alleviate the aforementioned issues, challenges such as view
inconsistency and significant content variance persist due to the inherent
generative nature of large diffusion models, posing extensive difficulties in
leveraging these images effectively. To overcome this hurdle, we advocate
integrating a discriminator alongside a novel Diffusion-GAN dual training
strategy to guide the training of 3D models. For the incorporated
discriminator, the synthesized multi-view images are considered real data,
while the renderings of the optimized 3D models function as fake data. We
conduct a comprehensive set of experiments that demonstrate the effectiveness
of our method over baseline approaches.Comment: Project Page: https://github.com/buaacyw/IT3D-text-to-3
Variational Relational Point Completion Network for Robust 3D Classification
Real-scanned point clouds are often incomplete due to viewpoint, occlusion,
and noise, which hampers 3D geometric modeling and perception. Existing point
cloud completion methods tend to generate global shape skeletons and hence lack
fine local details. Furthermore, they mostly learn a deterministic
partial-to-complete mapping, but overlook structural relations in man-made
objects. To tackle these challenges, this paper proposes a variational
framework, Variational Relational point Completion Network (VRCNet) with two
appealing properties: 1) Probabilistic Modeling. In particular, we propose a
dual-path architecture to enable principled probabilistic modeling across
partial and complete clouds. One path consumes complete point clouds for
reconstruction by learning a point VAE. The other path generates complete
shapes for partial point clouds, whose embedded distribution is guided by
distribution obtained from the reconstruction path during training. 2)
Relational Enhancement. Specifically, we carefully design point self-attention
kernel and point selective kernel module to exploit relational point features,
which refines local shape details conditioned on the coarse completion. In
addition, we contribute multi-view partial point cloud datasets (MVP and MVP-40
dataset) containing over 200,000 high-quality scans, which render partial 3D
shapes from 26 uniformly distributed camera poses for each 3D CAD model.
Extensive experiments demonstrate that VRCNet outperforms state-of-the-art
methods on all standard point cloud completion benchmarks. Notably, VRCNet
shows great generalizability and robustness on real-world point cloud scans.
Moreover, we can achieve robust 3D classification for partial point clouds with
the help of VRCNet, which can highly increase classification accuracy.Comment: 12 pages, 10 figures, accepted by PAMI. project webpage:
https://mvp-dataset.github.io/. arXiv admin note: substantial text overlap
with arXiv:2104.1015
Learning Dense UV Completion for Human Mesh Recovery
Human mesh reconstruction from a single image is challenging in the presence
of occlusion, which can be caused by self, objects, or other humans. Existing
methods either fail to separate human features accurately or lack proper
supervision for feature completion. In this paper, we propose Dense Inpainting
Human Mesh Recovery (DIMR), a two-stage method that leverages dense
correspondence maps to handle occlusion. Our method utilizes a dense
correspondence map to separate visible human features and completes human
features on a structured UV map dense human with an attention-based feature
completion module. We also design a feature inpainting training procedure that
guides the network to learn from unoccluded features. We evaluate our method on
several datasets and demonstrate its superior performance under heavily
occluded scenarios compared to other methods. Extensive experiments show that
our method obviously outperforms prior SOTA methods on heavily occluded images
and achieves comparable results on the standard benchmarks (3DPW)